Sales Performance and Customer Behavior Analysis based on Superstore Dataset¶
Table of Contents¶
Introduction¶
With growing demands and cut-throat competitions in the market, a Superstore Giant is seeking the knowledge in understanding what works best for them. They would like to understand which products, regions, categories and customer segments they should target or avoid.
Purpose¶
Calculate Sales and Profits by Category: To evaluate how different product categories contribute to overall sales and profits. This allows for strategic decisions on product focus and inventory management.
Profit Margin Analysis: To determine the profitability of products by analyzing profit margins. This helps in pricing strategy and cost management.
Sub-Category Analysis: To drill down into product sub-categories for more detailed insights. This assists in understanding specific market trends and consumer preferences.
Trend Analysis: To identify sales and profit trends over time, such as seasonal patterns or year-over-year growth. This information is crucial for forecasting and planning future sales strategies.
Impact of Discounts on Sales and Profits: To assess how discounts affect sales volumes and profit margins. This analysis helps in optimizing discount strategies to maximize revenue without eroding profit margins.
Customer Purchasing Behavior Analysis: To analyze the purchasing patterns and behaviors of different customer segments. This provides insights into customer needs and can inform targeted marketing campaigns.
Customer Loyalty: To assess customer loyalty by analyzing repeat purchase rates and customer retention metrics. This helps in understanding the customer relationship and identifying opportunities to enhance customer loyalty programs.
Word Cloud for Purchased Product Names: To visual representation of frequently purchased product names to identify popular items and trends among customers.
Data Loading¶
import os
from google.colab import drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import Category10
from wordcloud import WordCloud
from bokeh.resources import settings
settings.resources = 'inline'
drive.mount('/content/drive')
os.chdir("/content/drive/My Drive/DA/SuperStore")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data = pd.read_csv("Superstore.csv",encoding='latin1')
data.head()
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 | 2 | 0.00 | 41.9136 |
| 1 | 2 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 | 3 | 0.00 | 219.5820 |
| 2 | 3 | CA-2016-138688 | 6/12/2016 | 6/16/2016 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | ... | 90036 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 | 2 | 0.00 | 6.8714 |
| 3 | 4 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | 33311 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 | 5 | 0.45 | -383.0310 |
| 4 | 5 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | 33311 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 | 2 | 0.20 | 2.5164 |
5 rows × 21 columns
# View data information
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9994 non-null int64 1 Order ID 9994 non-null object 2 Order Date 9994 non-null object 3 Ship Date 9994 non-null object 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 Country 9994 non-null object 9 City 9994 non-null object 10 State 9994 non-null object 11 Postal Code 9994 non-null int64 12 Region 9994 non-null object 13 Product ID 9994 non-null object 14 Category 9994 non-null object 15 Sub-Category 9994 non-null object 16 Product Name 9994 non-null object 17 Sales 9994 non-null float64 18 Quantity 9994 non-null int64 19 Discount 9994 non-null float64 20 Profit 9994 non-null float64 dtypes: float64(3), int64(3), object(15) memory usage: 1.6+ MB
# Check for missing values
data.isnull().sum()
Row ID 0 Order ID 0 Order Date 0 Ship Date 0 Ship Mode 0 Customer ID 0 Customer Name 0 Segment 0 Country 0 City 0 State 0 Postal Code 0 Region 0 Product ID 0 Category 0 Sub-Category 0 Product Name 0 Sales 0 Quantity 0 Discount 0 Profit 0 dtype: int64
# Display information for all types of features
data.describe(include=['O'])
#count: Number of non-null entries.
#unique: Number of unique values.
#top: Most frequent value.
#freq: Frequency of the most frequent value.
| Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Region | Product ID | Category | Sub-Category | Product Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 |
| unique | 5009 | 1237 | 1334 | 4 | 793 | 793 | 3 | 1 | 531 | 49 | 4 | 1862 | 3 | 17 | 1850 |
| top | CA-2017-100111 | 9/5/2016 | 12/16/2015 | Standard Class | WB-21850 | William Brown | Consumer | United States | New York City | California | West | OFF-PA-10001970 | Office Supplies | Binders | Staple envelope |
| freq | 14 | 38 | 35 | 5968 | 37 | 37 | 5191 | 9994 | 915 | 2001 | 3203 | 19 | 6026 | 1523 | 48 |
Data Analysis¶
Calculate sales and profits by category¶
# Aggregate sales and profits by category
category_sales = data.groupby('Category')['Sales'].sum()
category_profits = data.groupby('Category')['Profit'].sum()
category_sales
Category Furniture 741999.7953 Office Supplies 719047.0320 Technology 836154.0330 Name: Sales, dtype: float64
# Plotting pie chart for sales by category
plt.figure(figsize=(14, 7))
plt.subplot(1, 2, 1)
category_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Sales by Category')
plt.ylabel('')
# Plotting pie chart for profits by category
plt.subplot(1, 2, 2)
category_profits.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Profits by Category')
plt.ylabel('')
plt.tight_layout()
plt.show()
Summary: The pie charts provided show the distribution of sales and profits across different product categories.
The Technology category is both the highest in sales and profits, indicating strong market performance and profitability.
While Furniture has a significant share of sales, its contribution to profits is minimal, suggesting lower profit margins or higher costs associated with this category.
Office Supplies performs well in both sales and profits, indicating a balanced contribution to the overall performance.
Profit Margin Analysis¶
# Calculate profit margin
data['Profit Margin'] = data['Profit'] / data['Sales']
# Aggregate profit margins by category
category_profit_margin = data.groupby('Category')['Profit Margin'].mean()
print(category_profit_margin)
Category Furniture 0.038784 Office Supplies 0.138030 Technology 0.156138 Name: Profit Margin, dtype: float64
# Plotting pie chart for profit margin by category
plt.figure(figsize=(7, 7))
category_profit_margin.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Profit Margin by Category')
plt.ylabel('')
Text(0, 0.5, '')
Summary:
Technology not only contributes the most to overall profits but also has the highest profit margin. This indicates that it is both a high-revenue and high-profit category.
Office Supplies has a moderate profit margin, making it a balanced category in terms of both revenue and profitability.
Furniture has the lowest profit margin, suggesting that despite its sales, the profits generated from this category are relatively low. This might be due to higher costs associated with furniture or lower pricing strategies.
Sub-Category Analysis¶
# Aggregate sales and profits by sub-category
subcategory_sales = data.groupby('Sub-Category')['Sales'].sum()
subcategory_profits = data.groupby('Sub-Category')['Profit'].sum()
# Create a DataFrame from the aggregated series
df = pd.DataFrame({
'Sales': subcategory_sales,
'Profit': subcategory_profits
})
# Sort DataFrame by Sales for better visualization
df_sorted = df.sort_values(by='Sales', ascending=False)
fig, ax = plt.subplots(figsize=(12, 8))
bar_width = 0.4
# Positions of the bars on x-axis
r = range(len(df_sorted))
# Plot bars for Sales
bars1 = ax.bar(r, df_sorted['Sales'], color='b', width=bar_width, edgecolor='grey', label='Sales')
# Plot bars for Profit
bars2 = ax.bar([i + bar_width for i in r], df_sorted['Profit'], color='r', width=bar_width, edgecolor='grey', label='Profit')
# Set labels and title
ax.set_xlabel('Sub-Category', fontweight='bold')
ax.set_ylabel('Amount', fontweight='bold')
ax.set_xticks([i + bar_width / 2 for i in r])
ax.set_xticklabels(df_sorted.index, rotation=90)
ax.set_title('Total Sales and Profits by Sub-Category')
# Add value labels on bars
for bars in [bars1, bars2]:
for bar in bars:
yval = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2, yval + 500, round(yval, 2), ha='center', va='bottom', fontsize=8)
ax.legend()
plt.tight_layout()
plt.show()
Summary:
Profitable Categories: Electronics such as Phones and Copiers contribute significantly to overall profitability.
Loss-making Categories: Furniture items like Tables, Supplies, and Bookcases are generating losses, requiring further review of pricing or cost structures.
data['Order Date'] = pd.to_datetime(data['Order Date'])
# Set Order Date as index
data.set_index('Order Date', inplace=True)
# Resample data to get monthly sales and profit
monthly_sales = data.resample('M')['Sales'].sum()
monthly_profits = data.resample('M')['Profit'].sum()
# Create a ColumnDataSource
source = ColumnDataSource(data={
'date': monthly_sales.index,
'sales': monthly_sales.values,
'profits': monthly_profits.values
})
# Create the figure
p = figure(height=400, width=800, x_axis_type='datetime', title='Monthly Sales and Profits')
# Plotting monthly sales
p.line('date', 'sales', source=source, line_width=2, line_color=Category10[3][0], legend_label='Monthly Sales')
p.circle('date', 'sales', source=source, size=8, color=Category10[3][0], legend_label='Monthly Sales')
# Plotting monthly profits
p.line('date', 'profits', source=source, line_width=2, line_color=Category10[3][1], legend_label='Monthly Profits')
p.circle('date', 'profits', source=source, size=8, color=Category10[3][1], legend_label='Monthly Profits')
# Adding HoverTool with tooltips
hover = HoverTool(tooltips=[
('Date', '@date{%F}'),
('Sales', '@sales{$0.00}'),
('Profits', '@profits{$0.00}')
], formatters={'@date': 'datetime'})
p.add_tools(hover)
# Styling and layout adjustments
p.legend.location = 'top_left'
p.legend.click_policy = 'hide'
output_notebook()
# Show the plot
show(p)
Summery:
Sales trends with notable seasonal patterns, including peak sales in late-year months like September and November.
Profitability fluctuates monthly, sometimes showing negative margins despite high sales.
Yearly trends show overall growth, with 2017 notably peaking in November.
Impact of Discounts on Sales and Profits¶
discountsSalesProfit = data[['Discount', 'Sales', 'Profit']]
discountsSalesProfit = discountsSalesProfit.groupby(['Discount']).mean().reset_index()
# Extract data for plotting
discount = discountsSalesProfit['Discount']
sales_mean = discountsSalesProfit['Sales']
profits_mean = discountsSalesProfit['Profit']
# Plotting
plt.figure(figsize=(10, 6))
# Plotting Sales
plt.plot(discount, sales_mean, marker='o', linestyle='-', color='blue', label='Average Sales')
# Plotting Profits
plt.plot(discount, profits_mean, marker='o', linestyle='-', color='green', label='Average Profits')
# Adding labels and title
plt.title('Average Sales and Profits vs. Discount')
plt.xlabel('Discount')
plt.ylabel('Amount ($)')
plt.grid(True)
plt.legend()
# Display plot
plt.show()
Customer Purchasing Behavior Analysis¶
# purchasing behavior by region
region_sales = data.groupby('Region')['Sales'].sum()
region_profits = data.groupby('Region')['Profit'].sum()
print(region_sales)
print(region_profits)
Region Central 501239.8908 East 678781.2400 South 391721.9050 West 725457.8245 Name: Sales, dtype: float64 Region Central 39706.3625 East 91522.7800 South 46749.4303 West 108418.4489 Name: Profit, dtype: float64
# Plotting Sales
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) # Subplot for Sales
plt.pie(region_sales, labels=region_sales.index, autopct='%1.1f%%', startangle=140)
plt.title('Sales Distribution by Region')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# Plotting Profits
plt.subplot(1, 2, 2) # Subplot for Profits
plt.pie(region_profits, labels=region_profits.index, autopct='%1.1f%%', startangle=140)
plt.title('Profit Distribution by Region')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()
#purchasing behavior by customer segment
segment_sales = data.groupby('Segment')['Sales'].sum()
segment_profits = data.groupby('Segment')['Profit'].sum()
print(segment_sales)
print(segment_profits)
Segment Consumer 1.161401e+06 Corporate 7.061464e+05 Home Office 4.296531e+05 Name: Sales, dtype: float64 Segment Consumer 134119.2092 Corporate 91979.1340 Home Office 60298.6785 Name: Profit, dtype: float64
# Plotting Sales
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) # Subplot for Sales
plt.pie(segment_sales, labels=segment_sales.index, autopct='%1.1f%%', startangle=140)
plt.title('Sales Distribution by Segment')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# Plotting Profits
plt.subplot(1, 2, 2) # Subplot for Profits
plt.pie(segment_profits, labels=segment_profits.index, autopct='%1.1f%%', startangle=140)
plt.title('Profit Distribution by Segment')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()
Customer loyalty¶
# Counting orders per customer
loyalty_counts = data.groupby('Customer ID')['Order ID'].count()
loyalty_counts
Customer ID
AA-10315 11
AA-10375 15
AA-10480 12
AA-10645 18
AB-10015 6
..
XP-21865 28
YC-21895 8
YS-21880 12
ZC-21910 31
ZD-21925 9
Name: Order ID, Length: 793, dtype: int64
# Plotting
plt.figure(figsize=(10, 6))
plt.hist(loyalty_counts, bins=range(1, loyalty_counts.max() + 2), align='left', edgecolor='black')
plt.xlabel('Number of Orders')
plt.ylabel('Number of Customers')
plt.title('Customer Loyalty')
plt.xticks(range(1, loyalty_counts.max() + 1))
plt.grid(axis='y', alpha=0.75)
plt.show()
Word Cloud for Purchased Product Names¶
# Combine the product names into a single string
text = ' '.join(data['Product Name'])
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Summary: The word cloud summarizes the dataset by highlighting frequently occurring terms, which can be useful for identifying popular product features at a glance.